{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# $$CatBoost\\ Partial\\ Dependence\\ Plot\\ (PDP)\\ Tutorial$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### A partial dependence plot can show the nature of the relationship between the target and the feature, whether it is linear, monotonic or more complex."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let $x_S$ be the features for which the partial dependence function should be plotted and let $x_C$ be the other features used in the machine learning model $f$.\n",
    "#### The partial function  $f_{x_S}$ is estimated by calculating averages in the training data:\n",
    "\\begin{equation*}\n",
    "f_{x_s}(x_s) = \\frac{1}{n}\\sum\\limits_{i=1}^{n}f(x_s, x_c^{(i)})\n",
    "\\end{equation*}\n",
    "#### where n is the number of instances in the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### In practice, the set of features $S$ usually only contains one feature or a maximum of two, because one feature produces 2D plots and two features produce 3D plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from catboost import CatBoost, Pool, datasets\n",
    "from catboost import datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### Let's try to plot PDP on some dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df, _ = datasets.higgs()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = np.array(train_df.drop(0, axis=1))[:1000], np.array(train_df[0])[:1000]\n",
    "pool = Pool(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's train CatBoost:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "cb = CatBoost({'iterations': 50, 'verbose': False, 'random_seed': 42})\n",
    "cb.fit(pool);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's choose one feature and plot its PDP:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "features = 1\n",
    "\n",
    "_ = cb.plot_partial_dependence(pool, features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We also can plot PDP for two features at once. This time instead of simple line plot we get 2d heatmap:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = [1, 2]\n",
    "\n",
    "_ = cb.plot_partial_dependence(pool, features)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
